Skip to content

Self-host Mermaid + aggregator HTML quality fixes + scripted article minimums#1998

Merged
pethers merged 4 commits intomainfrom
copilot/improve-article-generation-templates
Apr 25, 2026
Merged

Self-host Mermaid + aggregator HTML quality fixes + scripted article minimums#1998
pethers merged 4 commits intomainfrom
copilot/improve-article-generation-templates

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 25, 2026

Mermaid self-host (already shipped on this branch)

  • Mermaid pinned in package.json (mermaid@11.4.1), vendored to js/lib/mermaid/ by scripts/copy-vendor-mermaid.ts during prebuild/predev
  • Loader (js/lib/mermaid-init.mjs) imports from same-origin relative path
  • CI guard tests/no-external-cdn.test.ts blocks any external CDN host in js/ or news/*.html
  • Article-Generation.md + SECURITY_ARCHITECTURE.md updated to reflect self-hosted Mermaid

Article quality — Phase 1

  • Aggregator HTML quality fixes (heading demotion, _Source: preamble strip, ## Article Sources appendix, slug normalisation)
  • scripts/validate-article.ts minimum-content validator (9 rule codes) wired into npm run validate-all
  • Doc updates in Article-Generation.md + analysis/templates/README.md
  • Re-aggregated 27 articles, re-rendered 54 HTML files

CI fix — HTMLHint id-unique regression (this commit)

The Phase-1 heading demotion exposed a latent slugger-dedup bug:

  • ### 📜 Sources slugged via github-slugger to -sources; my post-trim collapsed it to sources but the slugger's internal state still recorded -sources. A later ### Sources therefore got sources (not sources-1) and the rendered HTML had two id="rm-sources" attributes, failing HTMLHint's id-unique rule on 10 files / 16 errors.
  • Fix: pre-clean the heading text (strip leading non-letter / non-number characters via /^[^\p{L}\p{N}]+/u) before passing to slugger.slug(), so the slugger sees the cleaned form and its dedup-suffix state stays consistent. Same fix applied to aggregator.ts#anchorForTitle so Reader Intelligence Guide anchors agree.
  • Added 2 regression tests in tests/render-lib.test.ts reproducing the exact CI failure (📜 Sources + Sources, 🔒 Confidence Profile + Confidence Profile).
  • Validation:
    • htmlhint *.html news/*.html: 2798 files, 0 errors (was 2784 files, 16 errors in 10 files)
    • Vitest: 2109 / 2109 pass (was 2107; +2)
    • npm run validate-article: 27 / 27 pass

Out of scope (follow-up phases)

  • Template-side journalist scaffolding (📰 Reader Lede, 🧾 Key Facts Strip, pull quotes, timeline JSON)
  • Rendered-HTML quality gate (Article-Generation Visualize CIA Election 2026 Forecasting Models #20)
  • Per-type editorial briefs in news-*.md workflows

@pethers pethers marked this pull request as ready for review April 25, 2026 15:43
Copilot AI review requested due to automatic review settings April 25, 2026 15:43
@github-actions github-actions Bot added documentation Documentation updates dependencies Dependency updates security Security improvements html-css HTML/CSS changes javascript JavaScript code changes translation Translation updates isms ISMS compliance changes iso-27001 ISO 27001 controls nist-csf NIST CSF compliance cis-controls CIS Controls performance Performance optimization testing Test coverage refactor Code refactoring news News articles and content generation size-xl Extra large change (> 1000 lines) labels Apr 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ Automatic Labeling Summary

This PR has been automatically labeled based on the files changed and PR metadata.

Applied Labels: documentation,dependencies,security,html-css,javascript,translation,isms,iso-27001,nist-csf,cis-controls,performance,testing,refactor,size-xl,news

Label Categories

  • 🗳️ Content: news, dashboard, visualization, intelligence
  • 💻 Technology: html-css, javascript, workflow, security
  • 📊 Data: cia-data, riksdag-data, data-pipeline, schema
  • 🌍 I18n: i18n, translation, rtl
  • 🔒 ISMS: isms, iso-27001, nist-csf, cis-controls
  • 🏗️ Infrastructure: ci-cd, deployment, performance, monitoring
  • 🔄 Quality: testing, accessibility, documentation, refactor
  • 🤖 AI: agent, skill, agentic-workflow

For more information, see .github/labeler.yml.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

@pethers
Copy link
Copy Markdown
Member

pethers commented Apr 25, 2026

@copilot analyse and fix issues in Quality Checks / html-validation (pull_request)
Quality Checks / html-validation (pull_request)Failing after 41s

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the rendering and validation pipeline for generated political-intelligence articles by self-hosting Mermaid (no external CDN), improving aggregator-produced HTML structure (heading hierarchy, anchors, sources), and introducing a scripted minimum-content CI gate for aggregated article.md outputs.

Changes:

  • Vendor Mermaid from node_modules/ into js/lib/mermaid/ and add CI coverage to prevent external CDN references in runtime JS and rendered articles.
  • Improve aggregation/render quality by demoting in-body headings, stripping _Source: …_ preambles, normalizing slugs, and appending a single ## Article Sources section.
  • Add scripts/validate-article.ts and wire it into validate-all to enforce required article landmarks and placeholder/BLUF/per-doc constraints.

Reviewed changes

Copilot reviewed 16 out of 91 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/render-lib.test.ts Adds unit tests for heading demotion, source-preamble stripping, and stable anchor generation.
tests/no-external-cdn.test.ts New CI guard scanning js/** and news/*.html for forbidden CDN hosts; validates Mermaid loader is same-origin.
scripts/validate-article.ts New hard validator enforcing minimum content/structure rules for aggregated article.md.
scripts/render-lib/markdown.ts Normalizes heading slug IDs to avoid rm--* anchors when headings start with stripped characters.
scripts/render-lib/aggregator.ts Implements heading demotion + source preamble stripping + per-section source comments + appends ## Article Sources.
scripts/copy-vendor-mermaid.ts New script to copy Mermaid runtime .mjs assets into js/lib/mermaid/.
package.json Pins mermaid@11.4.1, adds predev/copy-vendor/validate-article, and wires validation into validate-all.
package-lock.json Lockfile updates reflecting Mermaid and its dependency tree.
js/lib/mermaid-init.mjs Switches Mermaid import from jsDelivr to local vendored path via import.meta.url.
analysis/templates/README.md Documents the script-enforced reader-facing output contract and validator rule codes.
analysis/daily/2026-04-21/realtime-1353/article.md Regenerated aggregated article reflecting new heading/source/appendix rules.
analysis/daily/2026-04-21/evening-analysis/article.md Regenerated aggregated article reflecting new heading/source/appendix rules.
analysis/daily/2026-04-20/evening-analysis/article.md Regenerated aggregated article reflecting new heading/source/appendix rules.
SECURITY_ARCHITECTURE.md Updates CSP/local-hosting note to include Mermaid and references the new CI guard.
Article-Generation.md Updates documentation to describe the new cleaning rules, sources appendix, validator, and vendored Mermaid flow.
.gitignore Ignores vendored Mermaid output directory (and includes an updated comment section).

Comment on lines +439 to +443
function stripSourcePreamble(body: string): string {
return body
.replace(/^_\s*Source:\s*\[?`[^\n]*?\n+/gim, '')
.replace(/^_\s*Source:\s*[^\n]*_\s*$\n?/gim, '');
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripSourcePreamble() leaves a leading blank line when stripping the bare _Source: file.md_ form because the second regex only consumes a single newline (\n?). This makes the unit test case (_Source: synthesis-summary.md_\n\nbody) return \nbody and will fail CI; consider consuming all following newlines (\n+) or trimming start after removal.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +172
// The aggregator emits one `### <dok_id>` per per-document analysis,
// where `<dok_id>` is a riksdagen identifier such as `HD12345` or
// `FiU17`. After in-body heading demotion (`### Document summary`,
// `### Classification`, …) every other `### …` heading inside the
// section body is *content*, not a new per-document boundary.
// We therefore split only on H3 headings whose text matches a
// dok_id-shaped token — everything between two such headings (or
// from the last one to end-of-region) is one section's body.
const DOK_ID_HEADING = /^###\s+(H[A-Z0-9]{6,10}|[A-ZÅÄÖ]{1,4}\d{4,8})\s*$/m;
let cursor = region;
// First pass — anchor to the first dok_id heading.
let m = cursor.match(DOK_ID_HEADING);
while (m && m.index !== undefined) {
const id = m[1]!;
const after = cursor.slice(m.index + m[0].length);
const next = after.match(DOK_ID_HEADING);
const body = next && next.index !== undefined ? after.slice(0, next.index) : after;
sections.push({ id, body });
if (!next || next.index === undefined) break;
cursor = after.slice(next.index);
m = cursor.match(DOK_ID_HEADING);
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractPerDocumentSections() only treats H3 headings that are a single dok_id token as section boundaries. The corpus already contains composite per-document headings like ### HD01CU27\-CU28 (e.g. analysis/daily/2026-04-17/realtime-1434/article.md), which this regex won’t match—so those sections won’t be validated for dok_id citations. Consider broadening DOK_ID_HEADING to accept composite IDs (hyphen/slash separated, and optional markdown escapes) or splitting on any H3 that starts with a dok_id token.

Copilot uses AI. Check for mistakes.
… to keep dedup state consistent

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/36e3fe38-a1d5-4826-9bb8-dc66c0df589b

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 25, 2026

@copilot analyse and fix issues in Quality Checks / html-validation (pull_request)
Qualit...

Root cause: my Phase-1 post-trim of leading hyphens in heading slugs desynchronised github-slugger's internal dedup state. When ### 📜 Sources slugged to -sources and I trimmed it to sources, the slugger still had -sources recorded — so a later ### Sources got sources instead of sources-1, producing two id="rm-sources" and 16 HTMLHint id-unique errors across 10 files.

Fix in c5d7d6d: pre-clean the heading text (strip leading non-letter/number chars via /^[^\p{L}\p{N}]+/u) before passing to slugger.slug(), so the slugger sees the cleaned form and its dedup state stays consistent. Same fix applied to aggregator.ts#anchorForTitle. Added 2 regression tests reproducing the exact CI case (📜 Sources+Sources, 🔒 Confidence Profile+Confidence Profile).

Local results: htmlhint now scans 2798 files / 0 errors; vitest 2109/2109 pass; npm run validate-article 27/27 pass.

Copilot AI requested a review from pethers April 25, 2026 15:54
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

@pethers pethers merged commit 8cdbdba into main Apr 25, 2026
22 checks passed
@pethers pethers deleted the copilot/improve-article-generation-templates branch April 25, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cis-controls CIS Controls dependencies Dependency updates documentation Documentation updates html-css HTML/CSS changes isms ISMS compliance changes iso-27001 ISO 27001 controls javascript JavaScript code changes news News articles and content generation nist-csf NIST CSF compliance performance Performance optimization refactor Code refactoring security Security improvements size-xl Extra large change (> 1000 lines) testing Test coverage translation Translation updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants